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Criterion-Referenced Testing: Issues and Applications 

Ronald K. Hambleton and William P. Gorth 
University of Massachusetts 



Over the years, standard procedures for constructing, administering, and 
analyzing tests and interpreting scores have become well— known to educators. 

But recently there have been numerous suggestions for and demonstrations of 
instructional models in the schools vznere the usual procedures for constructing 
tests and interpreting test scores are not so useful and in tome cases are 
completely inappropriate. Examples of these instructional models include: 

A Model of School Learning (Carroll, 1963, 1970), Individualized Instruction 
(Glaser, 1968), and Project PLAN (Flanagan, 1967, 1969). With these models, 
tests are being used for the purpose of establishing an individual’s achieve- 
ment on specified content, i.e* instructional obj ec tives, and of providing 
information for making a variety of instructional decisions. Since traditional 
norm-referenced tests are clearly inappropriate, we have seen the development 
of a new kind of testing, criterion-referenced tee ting . Criterion-referenced 
tests are specifically designed to meet the measurement needs of the new 
instructional models. The criteria for the measurements are standards defined 
when the instructional objectives are specified. For this reason, the tests 
are called criterion-referenced . 

The tenn, criterion-referenced test, was introduced by Glaser (1963) to 
make the distinction between tests designed to compare individuals and tests 
designed to measure individual achievement relative to some specified domain 
of tasks. Of the various definitions proposed for criterion— referenced tests 
(Kt iewell , 1969; Livingston, 1970; Ivens, 1970) we prefer the definition 



proposed by Glaser and Nitko (1971); 



That is, 



A criterion— referenced test is one that is deliberately 

constructed to yield measurements that are directly 

in t e r p re tab 1 e inter ms of specified perfo rmanc e s t and ards . 




According to Glaser and Nitko (1971) : 

Performance standards are generally specified by defining 
a class or domain of tasks that should be performed by the 
individual* Measurements are taken on representative 
samples o f tasks dr a wn from this domain * and such measure- 
ments are referenced directly to this domain for. each 
individual measured* 

Defining well-specified content domains, developing procedures for 
generating appropriate samples of test items, and setting performance standards 
represent significant problems for measurement specialists but they will not 
be discussed in this paper* Papers by Millman (1970), Glaser and Nitko (1971), 
Hively, Patterson and Page (1968), and Eormuth (1970) have addressed some of 
these issues. 

Unfortunately, because of their newness and some rather unique problems 
to be described later, there is a lack of information on matters such as test 
construction procedures and psychometric properties of criterion— referenced . 
tests* Seldom do even the most recent educational measurement textbooks 
include more than one or two pages on the topic. According to Cronbach (1970), 
"The testing movement has given too much attention to comparative interpreta- 
tions (to individual differences) and too little to absolute, criterion- 
referenced measurement." However, the need for such information is easily 
seen when one considers the fact that more and more schools each year are 
adopting the new instructional models* 

This paper will integrate existing information on criterion-referenced 
testing with some original research results* It is organized around three 
topics: (1) a comparison of norm— referenced and criterion— referenced testing, 

. (2) item analysis, reliability, and validity of criterion— referenced tests, 
and (3) a description of two applications of criterion-referenced testing* 
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A Comparison of Norm-Referenced and Criterion-Referenced Testing 
Norm-Referenced Tests 

Almost all of the available aptitude and achievement tests can be classified 
as norm— referenced because they are designed to measure individual differences. 
The meaning which can be attached to any particular score depends upon a 
comparison of that score to so^e relevant norm distribution. A norm-referenced 
test is constructed specifically to maximize the variability of test scores, 
since such a test is more likely to produce fewer errors in ordering the 
individuals on the measured ability. Since norm-referenced tests are often 
used for selection purposes. it follows that minimizing the number of 
order errors is extremely important. 

It is a well— known fact that norm-referenced tests are constructed using 
the traditional item analysis procedures (Gulliksen, 1950; Lord and Novicky 
1968). It is partly because o£ this fact that the test scores cannot be inter- 
preted relative to some well-defined content domain since items are normally 
selected to produce tests with desired statistical properties rather than to 
be representative of some content domain. Both easy and difficult test items 
do not usually appear in norm-referenced tests because they contribute very 
little to test score variance, Also items which do not measure the same ability 
as the majority cf other items in the test are usually removed. Empirical 
evidence to support these conclusions is provided by Cox (1965) . His work 
revealed that the selection of items from a total item pool by classical item 
analysis procedures resulted in tests which contained proportions of items 
measuring instructional objectives different from those in the total item pool. 



Criterion— Ref erenced Tests . .*\// f / . X 

The emphasis on mastery learning in the new instructional models has lead 
to an interest by measurement specialists in criter ion^-ref erenced testing. 
Criterion^ref erenced tests can be used to serve two purposes. First, they 
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can be used to provide very specific information on the performance levels 
of individuals on the instructional objectives. This information can be 
used, for example, to determine whether an individual has "mastered 1 ' particular 
objectives (Block, 1971). 

Second, criterion— referenced tests can be used to evaluate the effective- 
ness of instruction. Norm-referenced tests given at the end of a course are 
useless for making evaluative decisions on the effectiveness of instruction 
because they are not tailored to the instructional objectives. However, 
criterion-referenced tests combined pvossibly with the notion of Item-examinee 
sampling are useful to the curriculum evaluator because of the specificity of 
the results to the instructional objectives (Lord, 1962; Cronbach, 1963; 
Shoemaker 1970a, 1970b; Kambleion, Rovinelli, and Gorth, 1971; and Gorth, 
Schriber, and O’Reilly, 1971). 

What are the appropriate procedures for constructing, a criterion-referenced 
test? It should be clear that since a score on a criterion-referenced test is 
compared to some performance standard rather than to the performance of other 
individuals that for the test to be a good measuring instrument it will be 
necessary to change the item selection and test construction procedures. 

However, it is only recently that any attention has been given to the problem 
(Hively, Patterson and Page, 1968; Bormuth, 1970; Lindeman, Gorth and Allen, 
1969) . 

Since comparisons among individuals are of little or no interest when using 
a criterion— referenced test, it follows that a test constructor is not 
usually concerned with developing a test to maximize the variance of test 
scores. Therefore, a test developer cannot use classical item analysis pro- 
cedures to choose items because they were specifically designed to result in 
a test with maximum variance of test scores. For example, criterion-referenced 
tests are often used either before students are taught specific instructional 




objectives or immediately after students are taught specific instructional 
objectives. In the former situation, most students will answer few or none 
of the test items, i.e., low total scores, and in the latter situation, they 
will answer most or all of the items s i.e., high total scores. Both situations 
produce very little variation in total test scores within the group of students. 
Consequently, item discrimination indices, the biserial and point biserial 
correlation coefficients, will be very close to zero for most items which is 
considered an indication of a poor test item in classical test theory. However, 
item statistics based on correlational methods can be of some use in detecting 
poor items given that different standards are used to interpret the indices* 

More will be said about this and other psychometric issues in later sections. 

Some measurement specialists have discussed criterion-referenced tests as 
ones which would be scalable in a Guttman sense (Popham and Husek, 1969; 

Guttman, 1950)* In this case, knowing an individual's test-score would be 
sufficient information to reproduce his response pattern. We would know 
precisely which items he answered correctly and incorrectly. While this kind 
of test would be excellent for diagnostic purposes, these tests are difficult 
to construct (Cox and Graham, 1966). 

More typically, the items on a criterion— referenced test can be thought 
of as a sample from some well-defined content domain. Knowing a student's 
test score does not allow us to accurately say which items were answered 
correctly, but we can make a pretty good estimate of the proportion of items 
in the domain that he could answer (Popham and Husek, 1969). . 

It would seem that what is needed now is some test theory developed specif- 
ically for criterion— referenced tests. Some progress has been made in this 
direction by Cronbach and Gleser (1965) , Krieweil (1969), Glaser and Nitko 
(1971) and Hambleton and Novick (1971) • 
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A Summary 

While admitting that a test cannot be classified as either a norm-referenced 
or criterion-referenced test by simply looking at it., the two kinds of tests 
are designed for quite different reasons and constructed using different pro- 
cedures, The norm-referenced test is constructed using traditional item 
analysis procedures for the purpose of making comparisons among individuals. 

In contrast, a criterion— referenced test is designed to facilitate decision- 
making relating to individual performance and effectiveness of instruction. 
Procedures for constructing the tests are only now being developed. 

It is Interesting to note, however, that criterion-referenced tests can 
be used to make comparisons among individuals and norm-referenced tests can be 
used to measure the extent to which individuals master instructional objectives. 
But, since the purpose of criterion— ref erenced tests and norm-referenced tests 
is basically, different, one would in most cases be a weak substitute for the 
other. 

Item Analysis, Reliability, and Validity 

Item Analysis 

Since "he traditional approach to item analysis is of limited usefulness 
in developing criterion— ref erenced tests other procedures needed to be developed 
Three approaches to item analysis of criterion— ref erenced tests will be dis- 
cussed in this section: (1) modification of traditional item analysis pro- 

cedures, (2) selecting items to measure change, and (3) item characteristic 
curves, ' - ^ 



Modification of traditional item analysis procedures. In criterion- 
referenced test development the item difficulty index Is useful for selecting 
"good” items. However, the item difficulty is used somewhat differently than 
when one is constructing a norm-referenced test. In that case, items with 



moderate difficulty are preferred 



because they increase the discriminating 
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power of the test. If such a strategy were employed in constructing a criterion 
referenced test there is every likelihood tha,t many of the best items would 
not be selected. How should the item difficulty index be used? If the content- 
domain is carefully specified, test items written to measure accomplishment of 
the objectives should also be carefully specified and closely associated with 
the objectives. Therefore all of the items associated with the same objective 
should be answered correctly by about the same proportion of examinees in a 
group 9 i.e., they should have approximately the same value for the item diffi- 
culty index. If an item has a value of the index quite different from all of 
the other items, it probably is measuring a performance which is identifiably 
different from the objective. If the indices of the items associated with an 
objective differ, several alternatives may be followed. Either the items which 
are least like the objective should be modified; (the it em . dif f icul ty index 
would be obtained on Lhe modified items and compared with the unaltered Items 
for congruency) or the objective written more specifically to refer only to 
similar items with similar indices. Thus, the item difficulty index may be 
used hi ..a new way to refine the items associated with an objective. 

Similarly the item discrimination indices, mentioned earlier, can be 
useful in item analysis for criterion-referenced test construction, although 
they were developed specifically for norm— referenced tests. Negative discri- 
mination indices serve as "warning flags 11 that items included on a criterion- 
referenced test may need modification. (There is also the possibility that a 
negative discrimination index is an indication of ineffective teaching and/cr 
ineffective instructional materials.) The negative value indicates that 



students who have generally done best on the total test answered the item 
incorrectly .more frequently than the students who did poorly on the test. A 
positive discrimination index is still meaningful; however, it is more likely 
to indicate some shortcoming of the instructional program. This follows since 



0 

ERIC 



most of the new instructional program^ using criterion— referenced tests are 
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designed to minimize post-test differences in achieveraent . (This is done by 
individualizing instruction to the extent that variables such as pace, sequence, 
and the instructional mode are optimally chosen for each individual.) Zero 
discriminating items may be quite acceptable for criterion-ref erenced tests. 

Selecting items to measure change . To demonstrate the effectiveness of 
instruction, evaluators attempt to construct criterion-ref erenced tests which 
give very different total scores before and after instruction* A number of 
researchers have been concerned with item analysis and selection procedures 
for constructing these kinds of tests (Cox and Vargas,, 1966). An interesting 
question concerns whether or not it matters what techniques are used to select 
[ items. That is, given a large pool of test items, how similar would the 

| selection of items be if different item statistics were .dsed # ‘- f V4 Ti^ere is some 

evidence from a study by E nglehart (1965) to suggest that with norm-referenced 
tests there is a high degree of agreement among items selected with various 
discrimination indices; i*or criterion-ref erenced tests is the situation 
similar? 

Cox and Vargas (1965) investigated the effect of employing different item 
selection techniques to identify items for norm- and criterion-ref erenced tests 
and the extent to which two methods of item analysis yielded the same relative 
| evaluation of items. Discrimination indices were computed for items on tests 

! which has been administered as pre-tests and post-tests in an individualized 

instruction program. The first index was the common D statistic (Englehart, 

; 1965) computed for items on the post-test data only. The second index was the 

difference in item difficulty between the pre-test and post-test data. (They 
| also investigated a third index but it Lis of no iriterestLhere. ) The results 

| indicated that some items which are highly desirable forjicriter ion-ref erenced^ 

| tests would be discarded on the basis of their D statistic because they faiil X. 

| discriminate between individuals v According , to Cox . (1970) , lT The ^prerand po&t-test; i 
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method of item analysis produced results sufficiently different from tradi- 
tional methods to warrant its consideration in those cases where score varia- 
bility is not the concern, such as in criterion-referenced measures.” 

Using the same methodology but different test items and groups of 
examinees, the Cox arid Vargas (1966) study was replicated arid extended to 
provide the results reported below. The test items came from two mathematics 
areas, algebra and trigonometry. The algebra test items were administered to 
110, lltn grade students at Hopkins High School in Minneapolis, Minnesota.. 

The trigonometry test items were administered to 102, 11th grade students at 
Kailua High School in Kailua, Hawaii. The items were administered to the 
students three times: (1) a pre-test, (2) an immediate post- test, and (3) a 
delayed post-test about one month after instruction. 

The three item statistics considered in the study were r e , p' and p" 

o • B'.’ & 

-where: 

r g = the biserial correlation for item g on the post-test, 

Pg = the difference between the proportion of individuals who 

correctly answered item g on the post-test and the. pre-test, and 

Pg = the difference between the proportion of individuals who 

correctly answered item g on the delayed post-test, and the pre-test. 

From Table 1 it is apparent that there is little relationship between 
r g Pg or Tg and pg for either set of test items. The correlation between 

Pg. and p” is higher than the other 1 two but the statistics are based on the 
same; pre-test data. 7 . - 7 ; , 7 ' •■. 7 ;- 7 " 

Tables 2 and 3 report the similarity of items selected using the three 
item statistics for test made up of 25%, 50%, and 75% of the; initial item pool. 
It is clear from the results that the choice of statistics has a significant 
effect on the .final selection, of test ; items. --y' . ^ \ 



7 ® 
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Table 1 

Spearman's Rank Correlations .Among 
Three Sets of Item Parameters 



Algebra Test 
Number of 

Indices Items Correlation 



Trigonometry Test 
Number of 

Items Correlation 



r g 


and 


Pg 


57 


.38** 


r g 


and 


p" 

*g 


57 


.28* V 

y; 


p g 


and 


P" 

Pg 


57 


.78** 



75 




- . 26* 


75. . ■ 




-.31** 


75 




.68** 



*p<. 05 

**p<.01 
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Table 2 

Percentage- of Overlap Between Items Selected According 
to Each Pair of Item Analysis Indices 
(Algebra Test - 57 Items) 



Proportion of the 

original item pool Baseline: Minimum 

selected in the test possible overlap r g and p^, r g and p^ p^and p g 



1/4 (14 items) 


or 


35.7% 


35.7% 


71.4% 


1/2 (28 items) 


0% 


58.6% 


62.0% 


79.3% 


3/4 (43 items) 


66.7% 


86.0% 


81.4% 


86.0% 



Table 3 



Percentage of Overlap Between Items Selected 
Each Pair of Item Analysis Indices 
(Trigonometry Test - 75 items); 



to 





Proportion of the - 
original item pool 
selected in the test 


Baseline: Minimum 

possible overlap 


r g and pg ; 


r g and p£ 


Pg and P” 


1/4 (19 items) 


0% 


21.0% 


5.2% 


57.9% 


1/2 (37 items) 


0% 


39.5% . 


39.5% 


76.3% 


3/4 (57 items) 




\ . 7i.9%^> 


71.9% 


90.0% 
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In summary, the differences are not surprising, but the magnitude of 



the differences is. This emphasizes the importance of choosing the appropriate 
item statistics to select items for criterion-referenced tests. Although Cox 
and Vargas (1966) endorse the change in item difficulty index as a criterion 



for item selection, they do point out, "the need for developmental work on 
item analysis procedures when only one test administration is possible." 

Item characteristic curve. One of the more interesting suggestions for 
item analysis of criterion-referenced tests was made Wardrop (1970). He 
suggested that the item characteristic Curve might be a useful alternative to 
scraei of the traditional item analysis procedures. 

The notion of an item characteristic curve comes from .the work of Lord 
(1952, 1968) , Birnbaum (1968) and others in the area of latent trait theory. 

For the case of a unidimensional test, a latent trait model specifies a func- 
tion which relates the probability of success on an item to the underlying 
latent trait or ability which the test measures. The choice of different 
mathematical forms for the item characteristic curve has led to the development 
Of different latent trait models (Lord and Novick, 1968) . The latent trait 
qj- ability for each individual could be conceptualized as his position on an 
ability scale ranging "from no proficiency at all to perfect performance" 

(Glaser, 1963). The measurement problem is to locate the individual in the 
correct location on the ability continuum. 

As suggested earlier, various functions have been proposed for the item • 

characteristic curve. For example, Birnbaum chose a 'two-parameter- : logistic, 



curve, 



\ l . Pg(x) = [1 + e g • g • ^ r -V ! 

as the form of the item characteristic^ curve" in his ;modei wheire pg(x/-j ';is the 
i probability that an examinee with ability x answers item £ correctly. The 
parameter bg is usually referred to as the index of . item difficulty , whereas 



a~ is referred to as an index of item discrimination. v (The constant D is , a 



o 
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scaling factor.) What limited empirical work has been done on various latent 
trait models reveals fairly good fits to real data (Ross, 1966; Wright, 1968; 
Lord, 1968; and Hambleton and Traub, 1970). Information on assumptions under- 
lying the latent trait models are discussed by Lord and Novick (1968). 

Why is this such an attractive approach? First, in theory at least, the 
: item parameters (difficulty and discrimination) remain invariant from group to 
group which is certainly not generally true of traditional item parameters. 

For example, the conventional item difficulty, defined as the proportion of 
examinees in a group who correctly answer the item, varies as a function of 
the ability of the; group. The invariance of the item difficulty parameter 
would permit the construction of tests with specific characteristics without 
prior knowledge of the ability of the examinees. Also, it is theoretically 
possible to measure growth using the latent trait ability scale because it is 
an interval scale. 

An important problem to solve before this particular approach to item 
analysis and ability estimation becomes practical is the development of an 
efficient procedure for estimating item parameters and abilities. Some progress 
on the problem has been made by Lord (1968) and by Bock (1971)t Another 
problem for research concerns the empirical verification of the invariance 
property of the item parameters* 

Reliability /,■ d;'*'- ,, . v v y yVV- y 

In many situations where criterionyref erenced tests ; ar e used 
little or 
of a reliabi! 
test scores , 



(such as in 

As Carver 



r no test score variance . And, since it is ^ell^kiiown- that (the’ 
lability coefficient depends, among other things; on the variance of 
as , it is apparent^ that the - common approaches io estimating reliability 
internal consistency and parallel-form) will be of limited |usef ulness. 
r (1970) points out, the reliability of any test depends, upon i! • 



replicability^ but replicability is not dependent upon^test score variability. 



0 

ERIC 









14 



If a group of examinees all obtained similar scores on parallel forms of some 



using traditional methods , would be close to zero. This rather extreme 
example points out the shortcoming of traditional reliability indices and 
serves to indicate the need for the development of alternate approaches. 

Cox and Graham (1966) report the use of the coefficient of reproducibility 

as an alternative to the classical approach to reliability estimation for one 

special type of criterion- referenced test. They calculate the coefficient 
for a sequentially scaled achievement test designed for use in an instructional 
model where performance objectives can be identified as being sequential in 
nature. Tests are said to be scalable if for a particular ordering of items , 
individuals are able to answer all questions up to a point and none beyond. 

The coefficient of reproducibility is a measure of the extent to which group 
performance satisfies this condition. As Cox (1970) says pitfalls of 
using reproducibility as a reliability estimate for achievement tests have not 
been explored. ” ; ■" *'■ ‘ V : 



test scores will probably need to be determined by non-cot relational techniques . 



test 



near perfect replicability exists even though test reliability, estimated 



Validity 



As in the case of reliability, the validity of criterion-ref erenced 




legit v infprpihl p from those delimited by the criterion techniques 



f or def ining content domains and item generation niles are t ol towed, con^t e n t ; . 



such t ter son and Page (l968)|)or Bormutti (1970) 

*’• ’ ( i: .• • A -- - • •: \ t . 
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validity follows. If other procedures are used, the task of determining 



content validity becomes much more difficult. 

For determining predictive and construct validity of criterion-refer- 
enced tests, both a non-co rrelational approach to validation and a suitable 
criterion must be found. Cox (1970) has suggested the use of experimental 
procedures to establish validity of a criterion-referenced test. For example, 
given that teaching is effective, one might determine the construct validity 
of a criterion— referenced test by observing the difference in performance 
between students who' have been exposed to instruction and those who have not. 
The bigger the difference the more valid the test could be said to 



Some Uses for Criterion-Referenced Testing 
In this final part of the paper we will consider the application of 
criterion— ref erenced tests in the areas of individual assessment and program 
evaluation. .... 



Individual Assessment 

A new instructional model is the one used in the Jamesville— DeWitt (JD) 
High School in Syracuse, New York State (O'Reilly and Hambleton, 1971) in 
the 9th grade science course. It is organized into modules which consist 
of a series of instructional activities arranged into a hierarchy of objectives 



leading to mastery of a single concept or group of related concepts. The day 
to day instructional activities which, when taken together make up a module, 
are organized into a hierarchy of smaller submodules called learning activity 
packages (LAPs) . : ' r Within each instructional module are. four types of 
decisions. To provide information for decision-making the following 




criterion— ref erenced tests are administer ed % a module pretest^ a module post Lest,, 
and several LAP 'pretests and LAP posttests . J ‘ ‘ • 

Briefly let us consider - each decision separately. As a student begins tq,v,;T; 
work on a module, a module pretest is administered. Since items in the module 
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prstpst 3 . 3 T 0 closfily tied to ths ob j 6ctiv6s of s.11 of thu LAi^s in the itiodul6 the 

student's correct responses to items measuring the objectives in a LAP would be used 

to decide to omit the corresponding LAP from the student s -prescribed. ■ 

activities for the module. Such a procedure will insure that students 

will be working only on learning experiences directed toward goals which 

have not been mastered previously. The module posttest which is either 

the same test or a parallel form of the module pretest can be used for- 

prescribing remedial work for a student, for grading, and for evaluating 

the effectiveness of instruction in the LAPs . 

Analogous to the module pretests, the LAP pretests are used to 
prescribe a set, of objectives within the LAP that the student must demon- 
strate competency in before moving on to the next LAP in his prescription. 



LAP posttests are used to determine the extent to which students have 
satisfactorily completed the objectives of the LAP . 

The four decisions just described might conveniently be classified as 
either placement or mastery. Decisions relating to the diagnosis of learning 
difficulty can also be made from the .criterion-referenced tests if the 
incorrect responses to the items have been carefully constructed, i.e, , 
incorrect choices are included in an item because they are indicative of 



O 
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particular learning difficulties. Apparently this systematic constiuction 
of distractors for the purpose of diagnosing learning difficulty has not 
been carefully explored but offers much potential. In addition to being 
an excellent way of extracting more information from a crSterion-referenced 
test, it offers a systematic way for constructing item alternatives. ^ 

One problem that remains to be solved for programs similar in format 
to the JD Model is the development of guidelines for;; establishing cut-of f 
points (i.e. , how many items must an individual pass to demons tra te mastery). 
At least one researcher (Nitko, 19711 has suggested different ;cut-off point^;- 
. for diff erent'' 'Individuals. ■ . 



.-■a 9L 
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Another problem found in some of the new instructional models is 
the extensive amount of time which is taken up by testing- Although 
testing provides data for decision-making and one wants to maximize 
the number of correct decisions, it is apparent that the cost in terms 
of time is too much to allow tests to be of the length necessary to 
insure low probabilities of error for all types of decisions. Assum- 
ing that tests can be weighted according to their importance it should 
be possible to derive optimum test lengths for the case when the total 
testing time is. fixed/ 

While increasing test length is an obvious way of reducing errors in 
decision-making 5 alternate means include tailored testing (Lord, 1969; Ferguson, 
1969) ? differential weighting of response alternatives , and confidence 
testing (Wang and Stanley, 1970; Hambleton et al , 1970) ». All three 
approaches can be used with criterion-referenced test items, are based on 
intuitively appealing ideas, and offer more information per item on each 
examinee* However, there is little empirical data to support any of the 
approaches * 

Comprehensive Achievement Monitoring (CAM) 

Gorth, Schriber, and O’Reilly (1971) describe a model for the evaluation 
of student achievement '.in classrooms and for curriculum evaluation called 
Comprehensive Achievement Monitoring (CAM). All of the decision-making 
is made on the basis of criterion-referenced test results. The CAM design 
includes the. following components : : ^ X 

,1. The definition of a curriculum with behavioral pbiecpives; v :, 

^ 2, The writing of test "items to measure student performance 

on each objective which are criterion-ref erenced test items; 

3, The organization of a set of randbiiily; parallel tests; where 

each test is ade up of items measuririgX.all or a sample of all 
of the objectives in ’ the curriculum and therefore: represents 
item, sampling; ’ : k • V- ! Hv : =. V‘- Ai ’■ f *.■ 
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4. The design of longitudinal, usually every, three or foiitr 
weeks, schedule of test occasions throughout the course; 

5. The analysis of the test data and the reporting of results 
by computer, usually within a couple of days; 

6. The interpretation of the results by evaluators , teachers 
and students as a means for making better decisions about 
their instruction and curriculum; and 

7 * The modification of curriculum, instructional activities and 
the CAM design based upon the results. 

The CAM methodology has been designed to work well with any grade 
level or curricular area. In fact, it has already been used successfully 
in more than 20 schools, with more than 15,000 participating students, 
and at grade levels from 3rd to 12th arid in every academic subject area 
(Allen and Gorth, 1971). (Hamble ton. Go rth and O’Reilly [1971] provide 
a detailed report on one of the many applications . ) 



Particularly important to the success of the evaluation is the use of 
the computer. It alleviates the frequently encountered bottlenecks of most 
evaluations j i.e., the analysis of data and the reporting of results. The computer 
allows maximum freedom in the design of evaluation which CAM has used by in 



corporating longitudinal testing with item sampling. 

The information which is provided in the CAM system includes: (1) for 
individual students, (a) the total score on the current test and all previous 
tests, and (b) information on the correctness of their response to each item 
corresponding to course objectives on the current test; and (2) for any 
subgroup of students and any set of questions alter each test administration, 
(a) the achievement level on each objective, and (b) achievement profiles 
which display graphically the level of achievement on all objectives on the 



previous test occasions. 

; The; 
ob j ec tive 
an achievement 



Computer allows students ? achievement to b e^;j> 1 o t:t ed- >pn. any given 
e (or group of obj ectives) for the entire course . : This plot , called 
cement prof i l,e gives a graphic presentation of the changes in group 
achievement throughout the course. Achievement profiles are a unique type 

erjjc ^ 



of information available from the CAM model. 

Figure 1 presents hypothetical achievement prof lies for four objectives 
from a course. In this example, objective 1 was taught between the first 
and second test administrations , objective 3 between the third and fourth 
testing and objective 4 between the fourth arid fifth. For the reason given 
below objective 2 was not taught. On the pre-test in the example, all ob- 
jectives except number two show achievement at the chance level or about 20% 



on the five option multiple-choice items . Using the achievement profiles 
after the second test administration the following decisions might be made: 
(a) objective 1 was not learned and should probably be retaught in a some- 
what different way; (h) since the performance level on objective 2 was high 
on both the first and second test administrations one could safely skip 
instruction on it. After the sixth testing on the basis of the CAM data 
the following decision could be made: (a) the performance level on objective 



3 is slipping and since it is an important objective it should be reviewed. 
It is also noted that the performance level on objective 1 has not changed. 
One might postulate that the topic is just too difficult for this particular 
group of students. 

In Summary, CAM represents an application of criterion-referenced 
testing to program evaluation carried out using longitudinal testing 
and the notion of item-examinee sampling. 



••• '.-.V . Summary ^ * ■' ■ 

In this paper we have attempted to highlight some of the special 
characteristics of criterion-referenced tests and compare them; with norm- 
referenced tests. Psychometric considerations in volved .in constructing a 
criterion- referenced test including item analysis , ;reliability and validity 
;> v 3re mentioned . Also the application o f c r i t e r ion- re f er ehce d testing to - 
vlhdividual'VasX^ssTCnt ’’ and -program evaluation was, described.:.; • , . i 



Figure I. Achievement profiles of a group of students on four objectives across 
ft- y J: "■■■ ■ eight test administrations. ■ . . ; 





TEST ADMINISTRATION 




Throughout the paper an attempt has been made to indicate some problems 
and shortcomings of the current testing methodology. Hopefully the dis- 
cussion of these problems will stimulate others to develop . the . methodology 
and models appropriate for criterion- referenced testing since these problems 
must rank among the most pressing in educational measurement . j . , 

. ‘ . ■ ■■ . ... •• ■’ j..;" ; ; \ ' \ ■ : 
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